Benford's Law
   HOME

TheInfoList



OR:

Benford's law, also known as the Newcomb–Benford law, the law of anomalous numbers, or the first-digit law, is an observation that in many real-life sets of numerical
data In the pursuit of knowledge, data (; ) is a collection of discrete values that convey information, describing quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpret ...
, the leading digit is likely to be small.Arno Berger and Theodore P. Hill
Benford's Law Strikes Back: No Simple Explanation in Sight for Mathematical Gem
2011.
In sets that obey the law, the number 1 appears as the leading significant digit about 30% of the time, while 9 appears as the leading significant digit less than 5% of the time. If the digits were distributed uniformly, they would each occur about 11.1% of the time. Benford's law also makes predictions about the distribution of second digits, third digits, digit combinations, and so on. The graph to the right shows Benford's law for
base 10 The decimal numeral system (also called the base-ten positional numeral system and denary or decanary) is the standard system for denoting integer and non-integer numbers. It is the extension to non-integer numbers of the Hindu–Arabic numer ...
, one of infinitely many cases of a generalized law regarding numbers expressed in arbitrary (integer) bases, which rules out the possibility that the phenomenon might be an artifact of the base-10 number system. Further generalizations published in 1995 included analogous statements for both the ''n''th leading digit and the joint distribution of the leading ''n'' digits, the latter of which leads to a corollary wherein the significant digits are shown to be a statistically dependent quantity. It has been shown that this result applies to a wide variety of data sets, including electricity bills, street addresses, stock prices, house prices, population numbers, death rates, lengths of rivers, and
physical Physical may refer to: * Physical examination, a regular overall check-up with a doctor * ''Physical'' (Olivia Newton-John album), 1981 ** "Physical" (Olivia Newton-John song) * ''Physical'' (Gabe Gurnsey album) * "Physical" (Alcazar song) (2004) * ...
and
mathematical constant A mathematical constant is a key number whose value is fixed by an unambiguous definition, often referred to by a symbol (e.g., an alphabet letter), or by mathematicians' names to facilitate using it across multiple mathematical problems. Cons ...
s. Like other general principles about natural data—for example, the fact that many data sets are well approximated by a
normal distribution In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is : f(x) = \frac e^ The parameter \mu ...
—there are illustrative examples and explanations that cover many of the cases where Benford's law applies, though there are many other cases where Benford's law applies that resist simple explanations. Benford's Law tends to be most accurate when values are distributed across multiple
orders of magnitude An order of magnitude is an approximation of the logarithm of a value relative to some contextually understood reference value, usually 10, interpreted as the base of the logarithm and the representative of values of magnitude one. Logarithmic dis ...
, especially if the process generating the numbers is described by a
power law In statistics, a power law is a functional relationship between two quantities, where a relative change in one quantity results in a proportional relative change in the other quantity, independent of the initial size of those quantities: one q ...
(which is common in nature). The law is named after physicist Frank Benford, who stated it in 1938 in an article titled "The Law of Anomalous Numbers", although it had been previously stated by
Simon Newcomb Simon Newcomb (March 12, 1835 – July 11, 1909) was a Canadian–American astronomer, applied mathematician, and autodidactic polymath. He served as Professor of Mathematics in the United States Navy and at Johns Hopkins University. Born in N ...
in 1881. The law is similar in concept, though not identical in distribution, to Zipf's law.


Definition

A set of numbers is said to satisfy Benford's law if the leading digit  () occurs with
probability Probability is the branch of mathematics concerning numerical descriptions of how likely an event is to occur, or how likely it is that a proposition is true. The probability of an event is a number between 0 and 1, where, roughly speaking, ...
: P(d) = \log_(d + 1) - \log_(d) = \log_\left(\frac\right) = \log_\left(1 + \frac\right). The leading digits in such a set thus have the following distribution: The quantity is proportional to the space between and on a
logarithmic scale A logarithmic scale (or log scale) is a way of displaying numerical data over a very wide range of values in a compact way—typically the largest numbers in the data are hundreds or even thousands of times larger than the smallest numbers. Such a ...
. Therefore, this is the distribution expected if the ''logarithms'' of the numbers (but not the numbers themselves) are uniformly and randomly distributed. For example, a number , constrained to lie between 1 and 10, starts with the digit 1 if , and starts with the digit 9 if . Therefore, starts with the digit 1 if , or starts with 9 if . The interval is much wider than the interval (0.30 and 0.05 respectively); therefore if log is uniformly and randomly distributed, it is much more likely to fall into the wider interval than the narrower interval, i.e. more likely to start with 1 than with 9; the probabilities are proportional to the interval widths, giving the equation above (as well as the generalization to other bases besides decimal). Benford's law is sometimes stated in a stronger form, asserting that the
fractional part The fractional part or decimal part of a non‐negative real number x is the excess beyond that number's integer part. If the latter is defined as the largest integer not greater than , called floor of or \lfloor x\rfloor, its fractional part ca ...
of the logarithm of data is typically close to uniformly distributed between 0 and 1; from this, the main claim about the distribution of first digits can be derived.


In other bases

An extension of Benford's law predicts the distribution of first digits in other bases besides
decimal The decimal numeral system (also called the base-ten positional numeral system and denary or decanary) is the standard system for denoting integer and non-integer numbers. It is the extension to non-integer numbers of the Hindu–Arabic numeral ...
; in fact, any base . The general form is : P(d) = \log_b(d + 1) - \log_b(d) = \log_b\left(1 + \frac\right). For (the binary and unary) number systems, Benford's law is true but trivial: All binary and unary numbers (except for 0 or the empty set) start with the digit 1. (On the other hand, the generalization of Benford's law to second and later digits is not trivial, even for binary numbers.)


Examples

Examining a list of the heights of the 58 tallest structures in the world by category shows that 1 is by far the most common leading digit, ''irrespective of the unit of measurement'' (see "scale invariance" below): Another example is the leading digit of . The sequence of the first 96 leading digits (1, 2, 4, 8, 1, 3, 6, 1, 2, 5, 1, 2, 4, 8, 1, 3, 6, 1, ... ) exhibits closer adherence to Benford’s law than is expected for random sequences of the same length, because it is derived from a geometric sequence.


History

The discovery of Benford's law goes back to 1881, when the Canadian-American astronomer
Simon Newcomb Simon Newcomb (March 12, 1835 – July 11, 1909) was a Canadian–American astronomer, applied mathematician, and autodidactic polymath. He served as Professor of Mathematics in the United States Navy and at Johns Hopkins University. Born in N ...
noticed that in
logarithm In mathematics, the logarithm is the inverse function to exponentiation. That means the logarithm of a number  to the base  is the exponent to which must be raised, to produce . For example, since , the ''logarithm base'' 10 o ...
tables the earlier pages (that started with 1) were much more worn than the other pages. Newcomb's published result is the first known instance of this observation and includes a distribution on the second digit as well. Newcomb proposed a law that the probability of a single number ''N'' being the first digit of a number was equal to log(''N'' + 1) − log(''N''). The phenomenon was again noted in 1938 by the physicist Frank Benford, who tested it on data from 20 different domains and was credited for it. His data set included the surface areas of 335 rivers, the sizes of 3259 US populations, 104 physical constants, 1800
molecular weight A molecule is a group of two or more atoms held together by attractive forces known as chemical bonds; depending on context, the term may or may not include ions which satisfy this criterion. In quantum physics, organic chemistry, and bioch ...
s, 5000 entries from a mathematical handbook, 308 numbers contained in an issue of ''
Reader's Digest ''Reader's Digest'' is an American general-interest family magazine, published ten times a year. Formerly based in Chappaqua, New York, it is now headquartered in midtown Manhattan. The magazine was founded in 1922 by DeWitt Wallace and his wi ...
'', the street addresses of the first 342 persons listed in ''American Men of Science'' and 418 death rates. The total number of observations used in the paper was 20,229. This discovery was later named after Benford (making it an example of
Stigler's law Stigler's law of eponymy, proposed by University of Chicago statistics professor Stephen Stigler in his 1980 publication ''Stigler’s law of eponymy'', states that no scientific discovery is named after its original discoverer. Examples include ...
). In 1995, Ted Hill proved the result about mixed distributions mentioned below.


Explanations

Benford's law tends to apply most accurately to data that span several orders of magnitude. As a rule of thumb, the more orders of magnitude that the data evenly covers, the more accurately Benford's law applies. For instance, one can expect that Benford's law would apply to a list of numbers representing the populations of UK settlements. But if a "settlement" is defined as a village with population between 300 and 999, then Benford's law will not apply. Consider the probability distributions shown below, referenced to a
log scale A logarithmic scale (or log scale) is a way of displaying numerical data over a very wide range of values in a compact way—typically the largest numbers in the data are hundreds or even thousands of times larger than the smallest numbers. Such a ...
. In each case, the total area in red is the relative probability that the first digit is 1, and the total area in blue is the relative probability that the first digit is 8. For the first distribution, the size of the areas of red and blue are approximately proportional to the widths of each red and blue bar. Therefore, the numbers drawn from this distribution will approximately follow Benford's law. On the other hand, for the second distribution, the ratio of the areas of red and blue is very different from the ratio of the widths of each red and blue bar. Rather, the relative areas of red and blue are determined more by the heights of the bars than the widths. Accordingly, the first digits in this distribution do not satisfy Benford's law at all. Thus, real-world distributions that span several
orders of magnitude An order of magnitude is an approximation of the logarithm of a value relative to some contextually understood reference value, usually 10, interpreted as the base of the logarithm and the representative of values of magnitude one. Logarithmic dis ...
rather uniformly (e.g., stock-market prices and populations of villages, towns, and cities) are likely to satisfy Benford's law very accurately. On the other hand, a distribution mostly or entirely within one order of magnitude (e.g., IQ scores or heights of human adults) is unlikely to satisfy Benford's law very accurately, if at all. However, the difference between applicable and inapplicable regimes is not a sharp cut-off: as the distribution gets narrower, the deviations from Benford's law increase gradually. (This discussion is not a full explanation of Benford's law, because it has not explained why data sets are so often encountered that, when plotted as a probability distribution of the logarithm of the variable, are relatively uniform over several orders of magnitude.Arno Berger and Theodore P. Hill
Benford's Law Strikes Back: No Simple Explanation in Sight for Mathematical Gem
2011. The authors describe this argument but say it "still leaves open the question of why it is reasonable to assume that the logarithm of the spread, as opposed to the spread itself—or, say, the log log spread—should be large" and that "assuming large spread on a logarithmic scale is ''equivalent'' to assuming an approximate conformance with enford's law (italics added), something which they say lacks a "simple explanation".
)


Krieger–Kafri entropy explanation

In 1970 Wolfgang Krieger proved what is now called the Krieger generator theorem. The Krieger generator theorem might be viewed as a justification for the assumption in the Kafri ball-and-box model that, in a given base B with a fixed number of digits 0, 1, ..., ''n'', ..., B - 1, digit ''n'' is equivalent to a Kafri box containing ''n'' non-interacting balls. A number of other scientists and statisticians have suggested entropy-related explanations for Benford's law.


Multiplicative fluctuations

Many real-world examples of Benford's law arise from multiplicative fluctuations. For example, if a stock price starts at $100, and then each day it gets multiplied by a randomly chosen factor between 0.99 and 1.01, then over an extended period the probability distribution of its price satisfies Benford's law with higher and higher accuracy. The reason is that the ''logarithm'' of the stock price is undergoing a
random walk In mathematics, a random walk is a random process that describes a path that consists of a succession of random steps on some mathematical space. An elementary example of a random walk is the random walk on the integer number line \mathbb Z ...
, so over time its probability distribution will get more and more broad and smooth (see above). (More technically, the
central limit theorem In probability theory, the central limit theorem (CLT) establishes that, in many situations, when independent random variables are summed up, their properly normalized sum tends toward a normal distribution even if the original variables themselv ...
says that multiplying more and more random variables will create a
log-normal distribution In probability theory, a log-normal (or lognormal) distribution is a continuous probability distribution of a random variable whose logarithm is normally distributed. Thus, if the random variable is log-normally distributed, then has a norma ...
with larger and larger variance, so eventually it covers many orders of magnitude almost uniformly.) To be sure of approximate agreement with Benford's law, the distribution has to be approximately invariant when scaled up by any factor up to 10; a
log-normal In probability theory, a log-normal (or lognormal) distribution is a continuous probability distribution of a random variable whose logarithm is normally distributed. Thus, if the random variable is log-normally distributed, then has a normal ...
ly distributed data set with wide dispersion would have this approximate property. Unlike multiplicative fluctuations, ''additive'' fluctuations do not lead to Benford's law: They lead instead to normal probability distributions (again by the
central limit theorem In probability theory, the central limit theorem (CLT) establishes that, in many situations, when independent random variables are summed up, their properly normalized sum tends toward a normal distribution even if the original variables themselv ...
), which do not satisfy Benford's law. By contrast, that hypothetical stock price described above can be written as the ''product'' of many random variables (i.e. the price change factor for each day), so is ''likely'' to follow Benford's law quite well.


Multiple probability distributions

Anton Formann Anton K. Formann (August 27, 1949, Vienna, Austria – July 12, 2010, Vienna) was an Austrian research psychologist, statistician, and psychometrician. He is renowned for his contributions to item response theory (Rasch models), latent class a ...
provided an alternative explanation by directing attention to the interrelation between the
distribution Distribution may refer to: Mathematics *Distribution (mathematics), generalized functions used to formulate solutions of partial differential equations * Probability distribution, the probability of a particular value or value range of a vari ...
of the significant digits and the distribution of the observed variable. He showed in a simulation study that long-right-tailed distributions of a
random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the po ...
are compatible with the Newcomb–Benford law, and that for distributions of the ratio of two random variables the fit generally improves. For numbers drawn from certain distributions (IQ scores, human heights) the Benford's law fails to hold because these variates obey a normal distribution, which is known not to satisfy Benford's law, since normal distributions can't span several orders of magnitude and the mantissae of their logarithms will not be (even approximately) uniformly distributed. However, if one "mixes" numbers from those distributions, for example, by taking numbers from newspaper articles, Benford's law reappears. This can also be proven mathematically: if one repeatedly "randomly" chooses a probability distribution (from an uncorrelated set) and then randomly chooses a number according to that distribution, the resulting list of numbers will obey Benford's law. A similar probabilistic explanation for the appearance of Benford's law in everyday-life numbers has been advanced by showing that it arises naturally when one considers mixtures of uniform distributions.


Invariance

In a list of lengths, the distribution of first digits of numbers in the list may be generally similar regardless of whether all the lengths are expressed in metres, yards, feet, inches, etc. The same applies to monetary units. This is not always the case. For example, the height of adult humans almost always starts with a 1 or 2 when measured in metres and almost always starts with 4, 5, 6, or 7 when measured in feet. But in a list of lengths spread evenly over many orders of magnitude—for example, a list of 1000 lengths mentioned in scientific papers that includes the measurements of molecules, bacteria, plants, and galaxies—it is reasonable to expect the distribution of first digits to be the same no matter whether the lengths are written in metres or in feet. When the distribution of the first digits of a data set is scale-invariant (independent of the units that the data are expressed in), it is always given by Benford's law. For example, the first (non-zero) digit on the aforementioned list of lengths should have the same distribution whether the unit of measurement is feet or yards. But there are three feet in a yard, so the probability that the first digit of a length in yards is 1 must be the same as the probability that the first digit of a length in feet is 3, 4, or 5; similarly, the probability that the first digit of a length in yards is 2 must be the same as the probability that the first digit of a length in feet is 6, 7, or 8. Applying this to all possible measurement scales gives the logarithmic distribution of Benford's law. Benford's law for first digits is base invariant for number systems. There are conditions and proofs of sum invariance, inverse invariance, and addition and subtraction invariance.


Applications


Accounting fraud detection

In 1972,
Hal Varian Hal Ronald Varian (born March 18, 1947 in Wooster, Ohio) is Chief Economist at Google and holds the title of emeritus professor at the University of California, Berkeley where he was founding dean of the School of Information. Varian is an eco ...
suggested that the law could be used to detect possible
fraud In law, fraud is intentional deception to secure unfair or unlawful gain, or to deprive a victim of a legal right. Fraud can violate civil law (e.g., a fraud victim may sue the fraud perpetrator to avoid the fraud or recover monetary compens ...
in lists of socio-economic data submitted in support of public planning decisions. Based on the plausible assumption that people who fabricate figures tend to distribute their digits fairly uniformly, a simple comparison of first-digit frequency distribution from the data with the expected distribution according to Benford's law ought to show up any anomalous results.


Use in criminal trials

In the United States, evidence based on Benford's law has been admitted in criminal cases at the federal, state, and local levels.


Election data

Walter Mebane Walter Richard Mebane, Jr. (born November 30, 1958) is a University of Michigan professor of political science and statistics and an expert on detecting electoral fraud. He has authored numerous articles on potentially fraudulent election results ...
, a political scientist and statistician at the University of Michigan, was the first to apply the second-digit Benford's law-test (2BL-test) in
election forensics Election forensics are methods used to determine if election results are statistically normal or statistically abnormal, which can indicate electoral fraud. It uses statistical tools to determine if observed election results differ from normally ...
. Such analysis is considered a simple, though not foolproof, method of identifying irregularities in election results. Scientific consensus to support the applicability of Benford's law to elections has not been reached in the literature. A 2011 study by the political scientists Joseph Deckert, Mikhail Myagkov, and Peter C. Ordeshook argued that Benford's law is problematic and misleading as a statistical indicator of election fraud. Their method was criticized by Mebane in a response, though he agreed that there are many caveats to the application of Benford's law to election data. Benford's law has been used as evidence of fraud in the 2009 Iranian elections. An analysis by Mebane found that the second digits in vote counts for President
Mahmoud Ahmadinejad Mahmoud Ahmadinejad ( fa, محمود احمدی‌نژاد, Mahmūd Ahmadīnežād ), born Mahmoud Sabbaghian ( fa, محمود صباغیان, Mahmoud Sabbāghyān, 28 October 1956),
, the winner of the election, tended to differ significantly from the expectations of Benford's law, and that the ballot boxes with very few invalid ballots had a greater influence on the results, suggesting widespread
ballot stuffing Electoral fraud, sometimes referred to as election manipulation, voter fraud or vote rigging, involves illegal interference with the process of an election, either by increasing the vote share of a favored candidate, depressing the vote share of ...
. Another study used bootstrap simulations to find that the candidate
Mehdi Karroubi Mehdi Karroubi ( fa, مهدی کروبی, Mehdi Karrubi, born 26 September 1937) is an Iranian Shia cleric and Iranian reform movement, reformist politician leading the National Trust Party (Iran), National Trust Party. Following 2009–2010 Iran ...
received almost twice as many vote counts beginning with the digit 7 as would be expected according to Benford's law, while an analysis from
Columbia University Columbia University (also known as Columbia, and officially as Columbia University in the City of New York) is a private research university in New York City. Established in 1754 as King's College on the grounds of Trinity Church in Manhatt ...
concluded that the probability that a fair election would produce both too few non-adjacent digits and the suspicious deviations in last-digit frequencies as found in the 2009 Iranian presidential election is less than 0.5 percent. Benford's law has also been applied for forensic auditing and fraud detection on data from the 2003 California gubernatorial election, the
2000 File:2000 Events Collage.png, From left, clockwise: Protests against Bush v. Gore after the 2000 United States presidential election; Heads of state meet for the Millennium Summit; The International Space Station in its infant form as seen from ...
and
2004 United States presidential election The 2004 United States presidential election was the 55th quadrennial presidential election, held on Tuesday, November 2, 2004. The Republican ticket of incumbent President George W. Bush and his running mate incumbent Vice President Dick Chene ...
s,Walter R. Mebane, Jr., "Election Forensics: The Second-Digit Benford's Law Test and Recent American Presidential Elections" in ''Election Fraud: Detecting and Deterring Electoral Manipulation'', edited by R. Michael Alvarez et al. (Washington, D.C.: Brookings Institution Press, 2008), pp. 162–81
PDF
/ref> and the 2009 German federal election; the Benford's Law Test was found to be "worth taking seriously as a statistical test for fraud," although "is not sensitive to distortions we know significantly affected many votes." Benford's law has also been misapplied to claim election fraud. When applying the law to Joe Biden's election returns for
Chicago (''City in a Garden''); I Will , image_map = , map_caption = Interactive Map of Chicago , coordinates = , coordinates_footnotes = , subdivision_type = Country , subdivision_name ...
,
Milwaukee Milwaukee ( ), officially the City of Milwaukee, is both the most populous and most densely populated city in the U.S. state of Wisconsin and the county seat of Milwaukee County. With a population of 577,222 at the 2020 census, Milwaukee is ...
, and other localities in the
2020 United States presidential election The 2020 United States presidential election was the 59th quadrennial presidential election, held on Tuesday, November 3, 2020. The Democratic ticket of former vice president Joe Biden and the junior U.S. senator from California Kamala Ha ...
, the distribution of the first digit did not follow Benford's law. The misapplication was a result of looking at data that was tightly bound in range, which violates the assumption inherent in Benford's law that the range of the data be large. The first digit test was applied to precinct-level data, but because precincts rarely receive more than a few thousand votes or fewer than several dozen, Benford's law cannot be expected to apply. According to Mebane, "It is widely understood that the first digits of precinct vote counts are not useful for trying to diagnose election frauds."


Macroeconomic data

Similarly, the macroeconomic data the Greek government reported to the European Union before entering the
eurozone The euro area, commonly called eurozone (EZ), is a currency union of 19 member states of the European Union (EU) that have adopted the euro (€) as their primary currency and sole legal tender, and have thus fully implemented EMU policies ...
was shown to be probably fraudulent using Benford's law, albeit years after the country joined.


Price digit analysis

Benford's law as a benchmark for the investigation of price digits has been successfully introduced into the context of pricing research. The importance of this benchmark for detecting irregularities in prices was first demonstrated in a Europe-wide study which investigated consumer price digits before and after the euro introduction for price adjustments. The introduction of the euro in 2002, with its various exchange rates, distorted existing nominal price patterns while at the same time retaining real prices. While the first digits of nominal prices distributed according to Benford's law, the study showed a clear deviation from this benchmark for the second and third digits in nominal market prices with a clear trend towards psychological pricing after the nominal shock of the euro introduction.


Genome data

The number of
open reading frame In molecular biology, open reading frames (ORFs) are defined as spans of DNA sequence between the start and stop codons. Usually, this is considered within a studied region of a prokaryotic DNA sequence, where only one of the six possible readin ...
s and their relationship to genome size differs between
eukaryote Eukaryotes () are organisms whose cells have a nucleus. All animals, plants, fungi, and many unicellular organisms, are Eukaryotes. They belong to the group of organisms Eukaryota or Eukarya, which is one of the three domains of life. Bacte ...
s and
prokaryote A prokaryote () is a single-celled organism that lacks a nucleus and other membrane-bound organelles. The word ''prokaryote'' comes from the Greek πρό (, 'before') and κάρυον (, 'nut' or 'kernel').Campbell, N. "Biology:Concepts & Connec ...
s with the former showing a log-linear relationship and the latter a linear relationship. Benford's law has been used to test this observation with an excellent fit to the data in both cases.


Scientific fraud detection

A test of regression coefficients in published papers showed agreement with Benford's law. As a comparison group subjects were asked to fabricate statistical estimates. The fabricated results conformed to Benford's law on first digits, but failed to obey Benford's law on second digits.


Statistical tests

Although the
chi-squared test A chi-squared test (also chi-square or test) is a statistical hypothesis test used in the analysis of contingency tables when the sample sizes are large. In simpler terms, this test is primarily used to examine whether two categorical variables ...
has been used to test for compliance with Benford's law it has low statistical power when used with small samples. The
Kolmogorov–Smirnov test In statistics, the Kolmogorov–Smirnov test (K–S test or KS test) is a nonparametric test of the equality of continuous (or discontinuous, see Section 2.2), one-dimensional probability distributions that can be used to compare a sample wit ...
and the Kuiper test are more powerful when the sample size is small, particularly when Stephens's corrective factor is used. These tests may be unduly conservative when applied to discrete distributions. Values for the Benford test have been generated by Morrow. The critical values of the test statistics are shown below: : These critical values provide the minimum test statistic values required to reject the hypothesis of compliance with Benford's law at the given
significance level In statistical hypothesis testing, a result has statistical significance when it is very unlikely to have occurred given the null hypothesis (simply by chance alone). More precisely, a study's defined significance level, denoted by \alpha, is the ...
s. Two alternative tests specific to this law have been published: First, the max () statistic is given by : m = \sqrt \cdot \max_^9 \left\. The leading factor \sqrt does not appear in the original formula by Leemis; it was added by Morrow in a later paper. Secondly, the distance () statistic is given by : d = \sqrt, where FSD is the first significant digit and is the sample size. Morrow has determined the critical values for both these statistics, which are shown below: : Morrow has also shown that for any random variable (with a continuous
PDF Portable Document Format (PDF), standardized as ISO 32000, is a file format developed by Adobe in 1992 to present documents, including text formatting and images, in a manner independent of application software, hardware, and operating systems. ...
) divided by its standard deviation (), some value can be found so that the probability of the distribution of the first significant digit of the random variable , X/\sigma, ^A will differ from Benford's law by less than The value of depends on the value of and the distribution of the random variable. A method of accounting fraud detection based on bootstrapping and regression has been proposed. If the goal is to conclude agreement with the Benford's law rather than disagreement, then the goodness-of-fit tests mentioned above are inappropriate. In this case the specific tests for equivalence should be applied. An empirical distribution is called equivalent to the Benford's law if a distance (for example total variation distance or the usual Euclidean distance) between the probability mass functions is sufficiently small. This method of testing with application to Benford's law is described in Ostrovski.


Range of applicability


Distributions known to obey Benford's law

Some well-known infinite
integer sequence In mathematics, an integer sequence is a sequence (i.e., an ordered list) of integers. An integer sequence may be specified ''explicitly'' by giving a formula for its ''n''th term, or ''implicitly'' by giving a relationship between its terms. For ...
s satisfy Benford's law exactly (in the asymptotic limit as more and more terms of the sequence are included). Among these are the
Fibonacci number In mathematics, the Fibonacci numbers, commonly denoted , form a sequence, the Fibonacci sequence, in which each number is the sum of the two preceding ones. The sequence commonly starts from 0 and 1, although some authors start the sequence from ...
s, the factorials, the powers of 2,In general, the sequence ''k''1, ''k''2, ''k''3, etc., satisfies Benford's law exactly, under the condition that log10 ''k'' is an
irrational number In mathematics, the irrational numbers (from in- prefix assimilated to ir- (negative prefix, privative) + rational) are all the real numbers that are not rational numbers. That is, irrational numbers cannot be expressed as the ratio of two inte ...
. This is a straightforward consequence of the
equidistribution theorem In mathematics, the equidistribution theorem is the statement that the sequence :''a'', 2''a'', 3''a'', ... mod 1 is uniformly distributed on the circle \mathbb/\mathbb, when ''a'' is an irrational number. It is a special case of the ergodic ...
.
That the first 100 powers of 2 approximately satisfy Benford's law is mentioned by Ralph Raimi. and the powers of ''almost'' any other number. Likewise, some continuous processes satisfy Benford's law exactly (in the asymptotic limit as the process continues through time). One is an
exponential growth Exponential growth is a process that increases quantity over time. It occurs when the instantaneous rate of change (that is, the derivative) of a quantity with respect to time is proportional to the quantity itself. Described as a function, a q ...
or
decay Decay may refer to: Science and technology * Bit decay, in computing * Software decay, in computing * Distance decay, in geography * Decay time (fall time), in electronics Biology * Decomposition of organic matter * Tooth decay (dental caries ...
process: If a quantity is exponentially increasing or decreasing in time, then the percentage of time that it has each first digit satisfies Benford's law asymptotically (i.e. increasing accuracy as the process continues through time).


Distributions known to disobey Benford's law

The
square root In mathematics, a square root of a number is a number such that ; in other words, a number whose ''square'' (the result of multiplying the number by itself, or  ⋅ ) is . For example, 4 and −4 are square roots of 16, because . E ...
s and
reciprocal Reciprocal may refer to: In mathematics * Multiplicative inverse, in mathematics, the number 1/''x'', which multiplied by ''x'' gives the product 1, also known as a ''reciprocal'' * Reciprocal polynomial, a polynomial obtained from another pol ...
s of successive natural numbers do not obey this law. Prime numbers in a finite range follow a Generalized Benford’s law, that approaches uniformity as the size of the range approaches infinity. Lists of local telephone numbers violate Benford's law. Benford's law is violated by the populations of all places with a population of at least 2500 individuals from five US states according to the 1960 and 1970 censuses, where only 19 % began with digit 1 but 20 % began with digit 2, because truncation at 2500 introduces statistical bias. The terminal digits in pathology reports violate Benford's law due to rounding. Distributions that do not span several orders of magnitude will not follow Benford's law. Examples include height, weight, and IQ scores.Singleton, Tommie W. (May 1, 2011).
Understanding and Applying Benford’s Law
, ''ISACA Journal'', Information Systems Audit and Control Association. Retrieved Nov. 9, 2020.


Criteria for distributions expected and not expected to obey Benford's law

A number of criteria, applicable particularly to accounting data, have been suggested where Benford's law can be expected to apply. ;Distributions that can be expected to obey Benford's law * When the mean is greater than the median and the skew is positive * Numbers that result from mathematical combination of numbers: e.g. quantity × price * Transaction level data: e.g. disbursements, sales ;Distributions that would not be expected to obey Benford's law * Where numbers are assigned sequentially: e.g. check numbers, invoice numbers * Where numbers are influenced by human thought: e.g. prices set by psychological thresholds ($9.99) * Accounts with a large number of firm-specific numbers: e.g. accounts set up to record $100 refunds * Accounts with a built-in minimum or maximum * Distributions that do not span an order of magnitude of numbers.


Benford’s law compliance theorem

Mathematically, Benford’s law applies if the distribution being tested fits the "Benford’s law compliance theorem". The derivation says that Benford's law is followed if the
Fourier transform A Fourier transform (FT) is a mathematical transform that decomposes functions into frequency components, which are represented by the output of the transform as a function of frequency. Most commonly functions of time or space are transformed, ...
of the logarithm of the probability density function is zero for all integer values. Most notably, this is satisfied if the Fourier transform is zero (or negligible) for ''n'' ≥ 1. This is satisfied if the distribution is wide (since wide distribution implies a narrow Fourier transform). Smith summarizes thus (p. 716):
Benford's law is followed by distributions that are wide compared with unit distance along the logarithmic scale. Likewise, the law is not followed by distributions that are narrow compared with unit distance … If the distribution is wide compared with unit distance on the log axis, it means that the spread in the set of numbers being examined is much greater than ten.
In short, Benford’s law requires that the numbers in the distribution being measured have a spread across at least an order of magnitude.


Tests with common distributions

Benford's law was empirically tested against the numbers (up to the 10th digit) generated by a number of important distributions, including the uniform distribution, the exponential distribution, the
normal distribution In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is : f(x) = \frac e^ The parameter \mu ...
, and others. The uniform distribution, as might be expected, does not obey Benford's law. In contrast, the
ratio distribution A ratio distribution (also known as a quotient distribution) is a probability distribution constructed as the distribution of the ratio of random variables having two other known distributions. Given two (usually independent) random variables ''X'' ...
of two uniform distributions is well-described by Benford's law. Neither the normal distribution nor the ratio distribution of two normal distributions (the
Cauchy distribution The Cauchy distribution, named after Augustin Cauchy, is a continuous probability distribution. It is also known, especially among physicists, as the Lorentz distribution (after Hendrik Lorentz), Cauchy–Lorentz distribution, Lorentz(ian) fun ...
) obey Benford's law. Although the
half-normal distribution In probability theory and statistics, the half-normal distribution is a special case of the folded normal distribution. Let X follow an ordinary normal distribution, N(0,\sigma^2). Then, Y=, X, follows a half-normal distribution. Thus, the ha ...
does not obey Benford's law, the ratio distribution of two half-normal distributions does. Neither the right-truncated normal distribution nor the ratio distribution of two right-truncated normal distributions are well described by Benford's law. This is not surprising as this distribution is weighted towards larger numbers. Benford's law also describes the exponential distribution and the ratio distribution of two exponential distributions well. The fit of chi-squared distribution depends on the degrees of freedom (df) with good agreement with df = 1 and decreasing agreement as the df increases. The ''F''-distribution is fitted well for low degrees of freedom. With increasing dfs the fit decreases but much more slowly than the chi-squared distribution. The fit of the log-normal distribution depends on the
mean There are several kinds of mean in mathematics, especially in statistics. Each mean serves to summarize a given group of data, often to better understand the overall value (magnitude and sign) of a given data set. For a data set, the '' ari ...
and the
variance In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbe ...
of the distribution. The variance has a much greater effect on the fit than does the mean. Larger values of both parameters result in better agreement with the law. The ratio of two log normal distributions is a log normal so this distribution was not examined. Other distributions that have been examined include the Muth distribution,
Gompertz distribution In probability and statistics, the Gompertz distribution is a continuous probability distribution, named after Benjamin Gompertz. The Gompertz distribution is often applied to describe the distribution of adult lifespans by demographers and ac ...
,
Weibull distribution In probability theory and statistics, the Weibull distribution is a continuous probability distribution. It is named after Swedish mathematician Waloddi Weibull, who described it in detail in 1951, although it was first identified by Maurice Re ...
,
gamma distribution In probability theory and statistics, the gamma distribution is a two-parameter family of continuous probability distributions. The exponential distribution, Erlang distribution, and chi-square distribution are special cases of the gamma d ...
,
log-logistic distribution In probability and statistics, the log-logistic distribution (known as the Fisk distribution in economics) is a continuous probability distribution for a non-negative random variable. It is used in survival analysis as a parametric model for eve ...
and the
exponential power distribution The generalized normal distribution or generalized Gaussian distribution (GGD) is either of two families of parametric continuous probability distributions on the real line. Both families add a shape parameter to the normal distribution. To dis ...
all of which show reasonable agreement with the law. The
Gumbel distribution In probability theory and statistics, the Gumbel distribution (also known as the type-I generalized extreme value distribution) is used to model the distribution of the maximum (or the minimum) of a number of samples of various distributions. Th ...
– a density increases with increasing value of the random variable – does not show agreement with this law.


Generalization to digits beyond the first

It is possible to extend the law to digits beyond the first. In particular, for any given number of digits, the probability of encountering a number starting with the string of digits ''n'' of that length discarding leading zeros is given by : \log_(n + 1) - \log_(n) = \log_\left(1 + \frac\right). For example, the probability that a number starts with the digits 3, 1, 4 is , as in the figure on the right. Numbers satisfying this include 3.14159..., 314285.7... and 0.00314465... . This result can be used to find the probability that a particular digit occurs at a given position within a number. For instance, the probability that a "2" is encountered as the second digit is : \log_\left(1 + \frac\right) + \log_\left(1 + \frac\right) + \cdots + \log_\left(1 + \frac\right) \approx 0.109. And the probability that ''d'' (''d'' = 0, 1, ..., 9) is encountered as the ''n''-th (''n'' > 1) digit is : \sum_^ \log_\left(1 + \frac\right). The distribution of the ''n''-th digit, as ''n'' increases, rapidly approaches a uniform distribution with 10% for each of the ten digits, as shown below. Four digits is often enough to assume a uniform distribution of 10% as "0" appears 10.0176% of the time in the fourth digit, while "9" appears 9.9824% of the time.


Moments

Average In ordinary language, an average is a single number taken as representative of a list of numbers, usually the sum of the numbers divided by how many numbers are in the list (the arithmetic mean). For example, the average of the numbers 2, 3, 4, 7, ...
and moments of random variables for the digits 1 to 9 following this law have been calculated:Scott, P.D.; Fasli, M. (2001
"Benford's Law: An empirical investigation and a novel explanation"
. ''CSM Technical Report'' 349, Department of Computer Science, Univ. Essex
*
mean There are several kinds of mean in mathematics, especially in statistics. Each mean serves to summarize a given group of data, often to better understand the overall value (magnitude and sign) of a given data set. For a data set, the '' ari ...
3.440 *
variance In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbe ...
6.057 *
skewness In probability theory and statistics, skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. The skewness value can be positive, zero, negative, or undefined. For a unimodal ...
0.796 *
kurtosis In probability theory and statistics, kurtosis (from el, κυρτός, ''kyrtos'' or ''kurtos'', meaning "curved, arching") is a measure of the "tailedness" of the probability distribution of a real-valued random variable. Like skewness, kurt ...
−0.548 For the two-digit distribution according to Benford's law these values are also known: *
mean There are several kinds of mean in mathematics, especially in statistics. Each mean serves to summarize a given group of data, often to better understand the overall value (magnitude and sign) of a given data set. For a data set, the '' ari ...
38.590 *
variance In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbe ...
621.832 *
skewness In probability theory and statistics, skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. The skewness value can be positive, zero, negative, or undefined. For a unimodal ...
0.772 *
kurtosis In probability theory and statistics, kurtosis (from el, κυρτός, ''kyrtos'' or ''kurtos'', meaning "curved, arching") is a measure of the "tailedness" of the probability distribution of a real-valued random variable. Like skewness, kurt ...
−0.547 A table of the exact probabilities for the joint occurrence of the first two digits according to Benford's law is available, as is the population correlation between the first and second digits: .


In popular culture

Benford's law has appeared as a plot device in some twenty-first century popular entertainment. * Television crime drama ''
NUMB3RS ''Numbers'' (stylized as ''NUMB3RS'') is an American crime drama television series that was broadcast on CBS from January 23, 2005, to March 12, 2010, for six seasons and 118 episodes. The series was created by Nicolas Falacci and Cheryl Heuton ...
'' used Benford's law in the 2006 episode "The Running Man" to help solve a series of high burglaries. *The 2016 film ''The Accountant'' relied on Benford's law to expose theft of funds from a robotics company. * The 2017
Netflix Netflix, Inc. is an American subscription video on-demand over-the-top streaming service and production company based in Los Gatos, California. Founded in 1997 by Reed Hastings and Marc Randolph in Scotts Valley, California, it offers a fil ...
series ''Ozark'' used Benford's law to analyze a cartel member's financial statements and uncover fraud. * The 2021 Jeremy Robinson novel ''Infinite 2'' applied Benford's law to test whether the characters are in a simulation or reality.


See also

* Fraud detection in predictive analytics * Zipf's law


References


Further reading

* * * Alex Ely Kossovsky.
Benford's Law: Theory, the General Law of Relative Quantities, and Forensic Fraud Detection Applications
', 2014, World Scientific Publishing. . * * * * * * *


External links


Benford Online Bibliography
an online bibliographic database on Benford's law.
Testing Benford's Law
An open source project showing Benford's law in action against publicly available datasets. {{Authority control Statistical laws Theory of probability distributions